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Abstract 

In this paper we address an issue that has been brought to the attention of the database 
community with the advent of the Semantic Web, i.e. the issue of how ontologies (and 
semantics conveyed by them) can help solving typical database problems, through a better 
understanding of KR aspects related to databases. In particular, we investigate this issue 
from the ILP perspective by considering two database problems, (i) the definition of views 
and (ii) the definition of constraints, for a database whose schema is represented also by 
means of an ontology. Both can be reformulated as ILP problems and can benefit from 
the expressive and deductive power of the KR framework D£+LOG^^. We illustrate the 
application scenarios by means of examples. 

KEYWORDS: Inductive Logic Programming, Relational Databases, Ontologies, Descrip- 
tion Logics, Hybrid Knowledge Representation and Reasoning Systems 



1 Motivation 

Inductive Logic Programming (ILP) has been historically concerned with the induc- 
tion of rules from examples for classification purposes ( [Nienhuys- Cheng and de Wolf 1997 1 . 
Due to the close relation between Logic Programming and Relational Databases 
(jCeri et al. 1990[) . ILP has established itself as a major approach to Relational 
Data Mining (|Dzeroski and Lavrac 2001|) . Indeed, Datalog (jCeri et al. 1989^ is 
the most widely used Knowledge Representation (KR) framework in ILP. Con- 
versely, interesting extensions of Datalog such as Datalog^^ (jEiter et al. 1997]) 
have attracted very little attention in ILP. Some effort has been made also at making 
ILP more able to face the challenges posed by Relational Data Mining applications, 
e.g. scalability (jBlockeel et al. 1999.) . However the actual added value of ILP with 
respect to far more efficient approaches still remains the use of prior conceptual 
knowledge (also known as background knowledge, or shortly BK) during the learn- 
ing process which enables the induction of conceptually meaningful rules. Yet, the 
BK in ILP is often not organized around a well-formed conceptual model. This 
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practice seems to ignore the latest achievements in conceptual modeling such as 
ontologies. 

In Artificial Intelligence, an ontology refers to an engineering artifact (more pre- 
cisely, produced according to the principles of Ontological Engineering (IGomez-Perez et al. 2004[) ) , 
constituted by a specific vocabulary used to describe a certain reality, plus a set of 
explicit assumptions regarding the intended meaning of the vocabulary words. This 
set of assumptions has usually the form of a first-order logical (FOL) theory, where 
vocabulary words appear as unary or binary predicate names, respectively called 
concepts and relations. More formally, an ontology is a formal explicit specifica- 
tion of a shared conceptualization for a domain of interest (jGruber 1993|) . Among 
the other things, this definition emphasizes the fact that an ontology has to be 
specified in a language that comes with a formal semantics. Only by using such a 
formal approach ontologies provide the machine interpretable meaning of concepts 
and relations that is expected when using an ontology-based approach. Among 
the formalisms proposed by Ontological Engineering, the most currently used are 
Description Logics (DLs) (jBaader et al. 2007j) . In particular, the advent of the Se- 
mantic Web (jBerners-Lee et al. 2001|) has given a tremendous impulse to research 
on DL-based ontology languages. Indeed the DL ST-LXQ (jHorrocks et al. 2000p 
has been the starting point for the definition of the W3C standard mark-up lan- 
guage OWL (jHorrocks et al. 2003P . Note that DLs are decidable fragments of FOL 
that are incomparable with Clausal Logics (CLs) as regards the expressive power 
( [Borgida 1996^ and the semantics (jRosati 2005b|l . Yet, DLs and CLs can be com- 
bined according to some limited forms of hybridization. E.g., P/I-I-LOG^^ is a gen- 
eral KR framework that allows for the tight integration of DLs and Datalog^^ 
by imposing the condition of weak I?£-safeness on hybrid rules (IRosati 2006^ . We 



argue that the adoption of such hybrid KR systems can help overcoming the current 
difficulties in accommodating ontologies in ILP. 

In this paper we address an issue that has been brought to the attention of the 
database community with the advent of the Semantic Web, i.e. the issue of how 
ontologies (and semantics conveyed by them) can help solving typical database 
problems, through a better understanding of KR aspects related to databases. In 
particular, we investigate this issue from the ILP perspective by considering two 
database problems: 

• the definition of views 

• the definition of constraints 

for a database whose schema is represented also by means of an ontology. Both 
can be reformulated as ILP problems and can benefit from the expressive and 
deductive power of the KR framework I?£+LOG^^, mainly from its nonmonotonic 
(NM) features. We illustrate the application scenarios by means of examples. 
The paper is organized as follows. Section[5]provides basic notions on DLs, a short 



^ We prefer to use the name Di^+LDG^^ instead of the original one 73£-|-LOG in order to emphasize 
the Datalog^^ component of the framework. 
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Table 1. Syntax and semantics of some typical DL constructs. 



bottom (resp. top) concept 


_L (resp. T) 


III (resp. A^) 




atomic concept 


A 




c A^ 






(abstract) simple role 


S 




C A^ 


A Z_i 




(abstract) individual 


a 




G A^ 






concept 


C 










role 


R 










concept negation 






\ 






concept intersection 


Ci n C2 


n 






concept union 


Ci u C2 


u 






value restriction 


Vi? ■ c 


{x 


G A^ 


V?/ {x,y) 


G 7?^ 2/ G C^} 


existential restriction 


3R- C 


{x 


G A^ 




G 7?^ A y G C^} 


at least number restriction 


> nR 


{x 


G A^ 


\{y\{x, y) 


G > n} 


at most number restriction 


< nR 


{x 


G A^ 


\{y\{3^, y) 


G < n} 


at least qualif. number restriction 


>nR-C 


{x 


G A^ 


\{y 6 


{x,y) G > n} 


at most qualif. number restriction 


<nR - C 


{x 


G A^ 


\{y e 


{x,y) G < n} 


role inversion 


R- 


{(^ 


',y) e 


A^ X A^ 


(y,x) Gi?^} 


role intersection 


Ri n R2 




nR^ 







summary of KR research on the integration of DLs and CLs, and a brief introduc- 
tion to ILP. Section |3] introduces syntax, semantics and reasoning of 23£+LOG^^. 
Section |4] and Section [5] define the ILP proposals for inducing database views and 
database constraints, respectively, within the VC+hOG^^ framework. Section [6] 
surveys related work. Section [7] concludes the paper with final remarks. 



2 Background 

2.1 Representing ontologies 

DLs are a family of decidable FOL fragments that allow for the specification of 
knowledge in terms of classes {concepts), instances (individuals), and binary rela- 



tions between instances (roles) (Borgida 1996). Complex concepts can be defined 



from atomic concepts and roles by means of constructors. Syntax and semantics of 
some typical DL constructs are reported in Table [TJ E.g., concept descriptions in 
the basic DL AC are formed according to only the constructors of atomic negation, 
concept conjunction, value restriction, and limited existential restriction. The DLs 
ACC and ACM are members of the AC family. The former extends AC with (ar- 
bitrary) concept negation (also called complement and equivalent to having both 
concept union and full existential restriction), whereas the latter with number re- 
striction. The DL ACCAfTZ adds to the constructors inherited from ACC and ACN 
a further one: role intersection. Conversely, in the DL SHIQ (Horrocks et al. 2000p 
it is allowed to invert roles and to express qualified number restrictions of the form 
> nR ■ C and < nR ■ C where R is a simple role. Also transitivity holds for roles. 
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Table 2. Syntax and semantics of DL KBs. 



concept equivalence axiom 


Ci 




C2 


Ci — C2 


concept subsumption axiom 


Ci 


IZ 


C2 


Cf c Ci 


role equivalence axiom 


Ri 




R2 


Ri ~ R2 


role inclusion axiom 


Ri 


IZ 


R2 


Ri ^ R2 


concept assertion 


C{a) 




role assertion 


R{ 


a, 


b) 


{a^, b^) G i?^ 


individual equality assertion 


a 




b 


= b^ 


individual inequality assertion 


a 


9^ 


b 


a^^b^ 



A role (expression) is called complex if it contains any role operations other than 
inversion, e.g. role intersection. 

A DL knowledge base (KB) S can state both is-a relations between concepts (ax- 
ioms) and instance-of relations between individuals (resp. couples of individuals) 
and concepts (resp. roles) {assertions or facts). Axioms form the so-called termino- 
logical box (TBox) T whereas facts are contained in the so-called assertional box 
(ABox) A. A SHXQ KB encompasses also a role box (RBox) TZ which consists of a 
finite set of role equivalence and role inclusion axioms. Therefore hierarchies can be 
defined over not only concepts but also roles. Transitivity of roles is also specified 
by means of axioms. Thus, when a DL-based ontology language is adopted, an on- 
tology is nothing else than a TBox, possibly together with a RBox. If the ontology 
is populated, it corresponds to a whole DL KB, i.e. encompassing also an ABox. 
The semantics of DLs can be defined directly with set-theoretic formalizations as 
shown in Table [2] or through a mapping to FOL as shown in ( [Borgida 1996[ ). An 
interpretation I = (A-^, for a DL KB consists of a domain A-^ and a mapping 
function --^ . Under the Unique Names Assumption fUNA) ()Reiter 1980l) . individuals 
are mapped to elements of A-^ such that b-'' ii a b. Yet UNA does not hold 

by default in DLs. Thus individual equality (inequality) assertions may appear in 
a DL KB (see Table [2]). An interpretation I is a model of a KB E = {T,A) iff 
it satisfies all axioms and assertions in T and A . Also the KB represents many 
different interpretations, i.e. all its models. This is coherent with the Open World 
Assumption (OWA) that holds in FOL semantics. A DL KB is satisGable if it has at 
least one model. An ABox assertion a is a logical consequence of a KB S, written 
E 1= a, if all models of E are also models of a. 

The main reasoning task for a DL KB S is the consistency check which tries to 
prove the satisfiability of E. The consistency check is performed by applying deci- 
sion procedures mostly based on tableau calculus. Another well known reasoning 
service in DLs is instance check, i.e., the check of whether an ABox assertion is 
a logical implication of a DL KB. A more sophisticated version of instance check, 
called instance retrieval, retrieves, for a DL KB E, all (ABox) individuals that 
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are instances of the given (possibly complex) concept expression C, i.e., all those 
individuals a such that S entails that a is an instance of C. In data-intensive ap- 
plications, querying KBs plays a central role. Instance retrieval is, in some aspects, 
a rather weak form of querying: although possibly complex concept expressions 
are used as queries, we can only query for tree-like relational structures, i.e., a DL 
concept cannot express arbitrary cyclic structures. The possibility of expressing 
conjunctive queries (CQ) and unions of conjunctive queries (UCQ) is widely stud- 
ied in DLs. Let Pq and P-ji be the alphabets of concept names and role names, 
respectively. A Boolean UCQ over the alphabet Pq U Pji is a FOL sentence of 
the form gi V . . . V g„, where each qi is a conjunction 3X conji{X) of atoms whose 
predicates are in Pq U P-r and whose arguments are either constants or variables 
from the tuple X . A Boolean CQ corresponds to a Boolean UCQ in the case when 
n ~ 1. The Boolean UCQ entailment problem in DLs is defined as follows: A KB 
E entails a UCQ Q = 51 V . . . V g„ , written as E gi V . . . V (7„ , if , for every 
model X of S, there is some i such that qi is satisfied in I and 1 < i < n. Note 
that instance check can be expressed as the problem of query entailment problem 
of a Boolean CQs constituted by just one ground atom. The Boolean CQ/UCQ 
containment proWeiijl in DLs is defined as follows: Given a VC-TTiox T, a Boolean 
CQ Qi and a Boolean UCQ Q2 over the alphabet Pc U P-r., Qi is contained in Q2 
with respect to 7", denoted hy T \= Qi Q2, iff, for every model I of T, if Qi is 
satisfied in I then Q2 is satisfied in I. This problem has been proved decidable for 
many DLs, notably for the very expressive STLIQ (jGlimm et al. 2008)) and SHOQ 
(|Glimm et al. 2008)) . Finally, when the UNA does not hold, it can be immediately 
reduced to the Boolean UCQ entailment problem (jCalvanese et al. 2008|) . In the 
rest of the paper we shall consider DLs without UNA. 

2.2 Integrating ontologies and relational databases 

The integration of ontologies and relational databases follows the tradition of KR 
research on hybrid systems, i.e. those systems which are constituted by two or 
more subsystems dealing with distinct portions of a single KB by performing spe- 
cific reasoning procedures (jFrisch and Cohn 199ip . The motivation for investigating 
and developing such systems is to improve on two basic features of KR formalisms, 
namely representational adequacy and deductive power, by preserving the other 
crucial feature, i.e. decidability. Those KR systems that integrate ontologies and 
relational databases will be referred to as DL-CL hybrid KR systems in the rest of 
the paper. They implement different solutions to the problem of combining DLs and 
CLs. Indeed DLs and CLs are FOL fragments incomparable as for the expressive- 
ness ( [Borgida 1996D and the semantics (|Rosati 2005aP but combinable at different 
degrees of integration. The integration is said to be tight when a model of the hy- 
brid KB is defined as the union of two models, one for the DL part and one for the 
CL part, which share the same domain. In particular, combining DLs with CLs in 
a tight manner can easily yield to undecidability if the interaction scheme between 



^ This problem was called existential entailment in | |Levy and Rousset 1998^ . 
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the DL and the CL part of a hybrid KB does not fulfill some condition of safe- 
iiess (|Rosati 2005b)) . Indeed safeness allows to solve the semantic mismatch between 
DLs and CLs, namely the OWA for DLs and the CWA for CL^. In the following 
we shall briefly describe two exemplary cases of tightly-integrated DL-CL hybrid 
KR systems: AC-log (jPonini et al. 1998^ and Carin ( [Levy and Rousset 19"98| . The 
former is safe whereas the latter is not. 

^£-log (jPonini et al. 1998)) is a hybrid KR system that integrates ACC (jSchmidt-Schauss and Smolka 199l)l 
and Datalog (jCeri et al. 1989 1). In particular, variables occurring in the body of 
rules may be constrained with ACC concept assertions to be used as 'typing con- 
straints'. This makes rules applicable only to explicitly named objects. A further 
restriction is that only Datalog atoms are allowed in rule heads. Reasoning for 
^£-log knowledge bases is based on constrained SLD-resolution, i.e. an extension 
of SLD-resolution with a tableau calculus for ACC to deal with constraints. Con- 
strained SLD-resolution is decidable and runs in single non-deterministic exponen- 
tial time. Constrained SLD-refutation is a complete and sound method for answer- 
ing ground queries, i.e. conjunctions of ground Datalog atoms and ACC concept 
assertions. 

A comprehensive study of the effects of combining DLs and CLs can be found in 
( [Levy and Rousset 1998D . Here the family Carin of hybrid languages is presented. 
Special attention is devoted to the DL ACCJVTZ. The results of the study can 
be summarized as follows: (i) answering CQs over ACCJVTZ TBoxes is decidable, 
(ii) query answering in a logic obtained by extending ACCAfTZ with non-recursive 
Datalog rules, where both concepts and roles can occur in rule bodies, is also de- 
cidable, as it can be reduced to answering a UCQ, (iii) if rules are recursive, query 
answering becomes undecidable, (iv) decidability can be regained by disallowing 
certain combinations of constructors in the logic, and (v) decidability can be re- 
gained by requiring rules to be role-safe, where at least one variable from each role 
literal must occur in some non-DL-atom. As in ^>C-log, query answering is decided 
using constrained resolution and a modified version of tableau calculus. 



2.3 Learning rules with ILP 

Inductive Logic Programming (ILP) was born at the intersection between Logic 



Programming and Concept Learning (Muggleton 1990). From Logic Programming 



it has borrowed the KR framework, i.e. Horn Clausal Logic (HCL). From Concept 
Learning it has inherited the inferential mechanisms for induction, the most promi- 
nent of which is generalization. Concept Learning is concerned with the problem of 
automatically inducing the general definition of some concept (called target), given 
examples labeled as instances or noninstances of the concept. In ILP the target is 
the predicate whose definition is returned by the inductive learning process as a hy- 
pothesis. The definition may consist of one or more clauses. A distinguishing feature 
of ILP with respect to other forms of Concept Learning is the use of prior knowledge 
of the domain of interest, called background knowledge (BK). Therefore, induction 



^ Note that the OWA and CWA have a strong influence on the results of reasoning. 
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with ILP generalizes from individual instances/observations in the presence of BK, 
finding valid hypotheses. Validity depends on the underlying setting. 



2.3.1 Settings 

At present, there exist several formalizations of induction in ILP that can be clas- 
sified according to the following two orthogonal dimensions: the scope of induction 
(discrimination vs characterization) and the representation of observations (ground 
definite clauses vs ground unit clauses) ( |De Raedt and Dehaspe 19"97| ). Discrimi- 
nant induction aims at inducing hypotheses with discriminant power as required in 
tasks such as classification where observations encompass both positive and nega- 
tive examples. Characteristic induction is more suitable for finding regularities in 
a data set. This corresponds to learning from positive examples only. For a thor- 
ough discussion of differences between discriminant and characteristic induction see 
(jMichalski 1983p . The second dimension affects the notion of coverage, i.e. the con- 
dition under which a hypothesis explains/confirms an observation. In learning from 
entailment (also called normal or explanatory ILP setting), hypotheses are clausal 
theories, observations are ground definite clauses, and a hypothesis covers an ob- 
servation if the hypothesis logically entails the observation (jFrazier and Pitt 1993[) . 
In learning from interpretations (also called nonmonotonic or confirmatory ILP 
setting), hypotheses are clausal theories, observations are Herbrand interpretations 
(ground unit clauses) and a hypothesis covers an observation if the observation 
is a model for the hypothesis (|De Raedt and Dzeroski 1994[) . Summing up, when 
learning from entailment with the aim of discrimination, a hypothesis is valid (or 
correct) if it logically entails all positive examples and none of the negative exam- 
ples. The former condition of validity is called completeness, whereas the latter is 
referred to as consistency. If the scope of induction is characterization, the condi- 
tion of consistency is dropped out from the notion of validity due to the absence 
of negative examples. The two settings for the case of learning from interpretations 
can be defined similarly. 

2.3.2 Techniques 

In Concept Learning, thus in ILP, generalization is traditionally viewed as search 
through a partially ordered space of inductive hypotheses (jMitchell 1982| . Accord- 
ing to this vision, an inductive hypothesis is a clausal theory and the induction of 
a single clause requires (i) structuring, (ii) searching and (iii) bounding the space 
of clauses ( |Nienhuys-Cheng and de Wolf 1997] ) . 

First we focus on (i) by clarifying how the algebraic notion of ordering can be 
applied to clauses. A generality relation allows for determining which one, between 
two clauses, is more general than the other. It defines a pre-order (or quasi order) 
on the set of clauses, i.e. a partially-ordered set of equivalence classes. One such 
ordering is 9-subsumption (jPlotkin 1970|) : Given two clauses C and D, we say that 
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C 6'-subsunics D if there exists a substitution 9, such that C9 C Given the 
usefulness of BK, orders have been proposed that reckon with it. Among them is 
relative subsumption (jPlotkin 197ip : Given two clauses C and D and a clausal 
theory K., we say that C subsumes D relative to K. if there exists a substitution 
9 such that K. \= V{C9 ^ D). Also, generalized subsumption (|Buntine IQSSP is 
of interest to this paper: Given two definite clauses C and D standardized aparlH 
and a definite program IC, we say that C subsumes D w.r.t. JC iff there exists 
a ground substitution 9 for C such that (i) head{C)9 — head{D)a and (ii) K. U 
body{D)a |= body{C)9 where cr is a Skolem substitutiorH for D with respect to 
{ C} U /C. In the general case, generalized subsumption is undecidable and does not 
introduce a lattic^ on a set of clauses. Because of these problems, 0-subsumption 
is more frequently used in ILP systems. Yet for Datalog generalized subsumption 
is decidable and admits a least general generalization. 

Once structured according to a generality order, the space of hypotheses can 
be searched (ii) by means of refinement operators. A refinement operator is a 
function which computes a set of specializations or generalizations of a clause 
according to whether a top-down or a bottom-up search is performed. The two 
kinds of refinement operator have been therefore called downward and upward, re- 
spectively. A good refinement operator should satisfy certain desirable properties 



van der Laag 1995). We shall illustrate these properties for the case of downward 



refinement operators but analogous conditions are actually required to hold for the 
upward ones as well. Ideally, a downward refinement operator should compute only 
a finite set of specializations of each clause - otherwise it will be of limited practical 
use. When it accomplishes this condition, it is called locally Enite. Furthermore, it 
should be complete: every specialization should be reachable by a finite number of 
applications of the operator. Finally, it is better only to compute proper special- 
izations of a clause, for otherwise repeated application of the operator might get 
stuck in a sequence of equivalent clauses, without ever achieving any real special- 
ization. Operators that satisfy all these conditions simultaneously are called ideal. 
It has been shown that ideal refinement operators do not exist for both full and 
Horn clausal languages ordered by either subsumption or the stronger orders (e.g. 
implication) . 

In order to define a refinement operator for full clausal languages, it is necessary to 
drop one of the three properties of idealness. Since local finiteness and completeness 
are usually considered the most important among these properties, this means that 
locally finite and complete, but improper refinement operators can be defined for 
full clausal languages. On the other hand, in order to retain all the three properties 
of idealness, it seems that the only possibility is to restrict the search space. Hence, 



This definition relies on the set notation for clauses. 

^ Two clauses C and D are said to be standardized apart if they have no variables in common. 

® Let 6 be a clausal theory and C be a clause. Let Xi, . . . , X„ be all the variables appearing 
in C, and ai,...,a„ be distinct constants (individuals) not appearing in B or C. Then the 
substitution {Xi / a\, . . . , Xn/ an\ is called a Skolem substitution for C w.r.t. B. 
A lattice is a partially ordered set (also called a poset) in which any two elements have a unique 
supremum (the elements' least upper bound) and an infimum (greatest lower bound). 
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the definition of refinement operators is usually coupled with the specification of a 
declarative bias for bounding the space of clauses (iii) . Bias concerns anything which 
constrains the search for theories, e.g. a language bias specifies syntactic constraints 
on the clauses in the search space. One such constraint is connectedness: A clause 
C is connected if each variable occurring in head{C) also occurs in body{C). The 
constraint of linkedness is also widely used: A definite clause C is linked if each 
literal k € C is linked. A literal G C is linked if at least one of its terms is linked. 
A term t in some literal k € C is linked with linking-chain of length 0, if < occurs 
in head{C), and with linking-chain of length d + 1, if some other term in 4 is linked 
with linking-chain of length d. The link-depth of a term t in li is the length of the 
shortest linking-chain of t. 

3 Integrating Ontologies and Databases with P/I+log^^ 

The KR framework of 2?£-|-LOG"^ (jRosati 2006^ allows for the tight integration 
of DLs (jBaader et al. 2007^ and Datalog"^ (|Eiter et al. 1997p . More precisely, it 
allows a DC KB to be extended with Datalog"'^ rules according to the so-called 
weaJf safeness condition as shown in the following. 

3. 1 Syntax 

Formulas in VC+LOG^"^ are built upon three mutually disjoint predicate alphabets: 
an alphabet Pq of concept names, an alphabet P-jz of role names, and an alphabet 
Py) of Datalog predicates. We call a predicate p a DL-predicate if either p G Pc 
or p £ P-jz- Then, we denote by TV a countably infinite alphabet of constant names. 
An atom is an expression of the form p{X), where p is a predicate of arity n and X 
is a n-tuple of variables and constants. If no variable symbol occurs in X , then p{X) 
is called a ground atom (or fact). If p G f'c U P-jz, the atom is called a DL-atom, 
while if p G Py) , it is called a Datalog atom. 

Definition 1 

Given a description logic VC, a VC+LOG^"^ KB S is a pair (E,n), where S is a 
DC KB and 11 is a set of Datalog^^ rules, where each rule R has the form 

Pi(Xi) v...yp„(x„) 

ri(yi), . . . , r,„(y,„), Si(Zi), . . . , Sfc(Zfe), not ui( W^i), ...,not Uh{Wh) (1) 
with n, m,k,h > 0, each pi{Xi), rj{Yj), si{Zi), Uk{Wk) is an atom and: 

• each Pi is either a DL-predicate or a Datalog predicate; 

• each rj, Uk is a Datalog predicate; 

• each si is a DL-predicate; 

• (DATALOG-safeness) every variable occurring in R must appear in at least one 
of the atoms ri{Yi), r„(y„), si(Zi), . . . , Sk{Zk)\ 

• (weak 2?£-safeness) every head variable of R must appear in at least one of 
the atoms ri( Fi), . . . , r,n{Yni). 
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We remark that the condition of weak I?£-safeness aUows for the presence of 
variables that only occur in DL-atonis in the body of R. This condition allows to 
overcome the main representational limits of the safe approaches by keeping the 
integration scheme still decidable. Indeed, the notion of I?£-safeness proposed in 
(jMotik et al. 2005P can be expressed as follows: every variable of R must appear in 
at least one of the atoms ri( Fi ),..., rm( F™). Therefore, 2?£-safeness forces every 
variable of R to occur also in the Datalog atoms in the body of R. This disables the 
possibility of expressing CQs and UCQs. By weakening the I>£-safeness condition, 
this possibility can be enabled. For these reasons, VC+LOG^"^ is located between 
^£-log and Carin along the expressivity line. 

Without loss of generality, we can assume that in a 'D£+LOG~'^ KB (5],n) all 
constants occurring in E also occur in IT. 

Example 1 

Let us consider a 2?£+LOG^^ KB B (adapted from (jRosati 2006P ') integrating the 
following DL-KB S (ontology about persons) 

[^1] PERSON □ 3 FATHER" .MALE 
[^2] MALE C PERSON 
[^3] FEMALE □ PERSON 
[^4] FEMALE □ -.MALE 

MALE (Bob) 

PERSON (Mary) 

PERSON (Paul) 

FATHER(John,Paul) 

and the following Datalog^^ program 11 (database about students): 

[Rl] boy(X) ^ enrolled(X,cl,ft), PERSON(X), not girl(X) 

[R2\ girl(X) ^ enrolled(X,c2,ft), PERSON(X) 

[Ri] boy(X)V girl(X) ^ enrolled(X,c3,ft), PERSON(X) 

[RA] FEMALE (X) ^ girl(X) 

[i?5] MALE(X) ^ boy(X) 

[i?6] man(X) ^ enrolled (X,c3,pt), FATHER (X,Y) 
enrolled (Paul , cl ,f t) 
enrolled(Mary ,cl ,f t) 
enrolled (Mary, c2,ft) 
enrolled (Bob , c3 ,f t) 
enrolled ( John, c3 ,pt) 

encompassing rules that mix DL-literals and Datalog- literals. The rule [-R3], e.g., 
says that: If X is a PERSON enrolled in the course c3 as a full-time student (ft), then 
X is either a boy or a girl. The rule [RQ] says that: If X is a FATHER (of some Y) 
enrolled in the course c3 as a part-time student (pt), then X is a man. Notice that 
the variable Y in i?6 is weakly-safe but not DL-safe, since Y does not occur in any 
Datalog literal of RQ. 
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3.2 Semantics 

For VC+LOG^^ two semantics have been defined: a FOL semantics and a NM se- 
mantics. The FOL semantics does not distinguish between head atoms and negated 
body atoms. Thus, the rule (1) is equivalent to: 

Pl{Xi) V ... V PniXn) V Wi(#i) V . . . V UhiWh) ^ 

n{Yi), . . . , r^{Y^), si(Zi), . . . , Sk{Zk) (2) 

The NM semantics is based on the stable model semantics of Datalog^^. Accord- 
ing to it, DL-predicates are still interpreted under OWA, while Datalog predicates 
are interpreted under CWA. Notice that, under both semantics, entailment can be 
reduced to satisfiability, since it is possible to express constraints in the Datalog 
program. In particular, it is immediate to verify the following theorem on ground 
query answering (|Rosati 2006|l . 

Theorem 1 

Given a D/I-f LOG^^ KB (E, n) and a ground atom a, (E, H) |= a if? (E, nu{^ a}) 
is unsatisfiable. 

Analogously, CQ answering can be reduced to satisfiability in Datalog^^, more 
precisely it can be performed by means of multiple satisfiability tests. Consequently, 
Rosati p006p concentrates on the satisfiability problem in VC+l^OG^^ KBs. It 
has been shown that, when the rules are made out of Datalog^ (i.e., without 
negated atoms), the above two semantics are equivalent with respect to the satis- 
fiability problem. In particular, FOL-satisfiability can always be reduced (in linear 
time) to NM-satisfiability by rewriting rules from the form (1) to the form (2). 
Hence, only the satisfiability problem under the NM semantics is deeply treated in 
(|Rosati 2006p . 

Example 2 

With reference to Example [1] it can be easily verified that all NM-models for B 
satisfy the following ground atoms: 

1. boy(Paul) (since rule [iZl] is always applicable for {X/Paul} and acts 
like a default rule, which can be read as follows: if X is a person enrolled in 
course cl, then X is a boy, unless we know for sure that X is a girl); 

2. girl (Mary) (since rule [R2] is always applicable for {X/Mary}); 

3. boy(Bob) (since rule [Ri] is always applicable for {X/Bob}, and, by rule [i?4], 
the conclusion girl (Bob) is inconsistent with E); 

4. MALE(Paul) (due to rule [iZ5] and conclusion 1); 

5. FEMALE (Mary) (due to rule [i?4] and conclusion 2). 

Notice that B |=wMFEMALE(Mary) , while E ^fol FEMALE(Mary) . In other words, 
adding rules has indeed an effect on the conclusions one can draw about DL- 
predicates. Moreover, such an effect also holds under the FOL semantics of I?£-|-LOG- 
KBs, since it can be verified that B |=i?oLFEMALE(Mary) in this case. 
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NMSAT-I?z:+log(B) 

1. satisfiable=false 

2. if there exists a partition (Gp, Gat) of grp^n) such that 

3. (a) n(Gp, Gjv) has a stable model and 

4. (b) r h G(3(^U Gp) C UCQ{Gm) 

5. then satisfiable=true 

6. endif 

return satisfiable 

Figure 1. The algorithm NMSAT-2?/:+LOG 

3.3 Reasoning 

The problem statement of NM-satisfiability for finite VC+LOG^"^ KBs relies on the 
aforementioned Boolean CQ/UCQ containment problem for the VC part and on the 
so-called DL-grounding of the Datalog"'^ component. In particular, DL-grounding 
is an adaptation of the grounding operation used in stable model semantics to the 
2?£-|-LOG"^ case. 

Given a P^C+LOG^^ KB B — (2,11), we denote by Cu the set of constants 
occurring in H. The DL-grounding of H, denoted as grp{ir), is a set of Boolean 
CQs obtained by grounding all and only the DL-parts of rule bodies and the DL- 
atoms appearing in rule heads in 11 with respect to the constants in Cu- Note that 
grounding in (11) is partial, since the variables that only occur in DL-atoms in 
the body of rules are not replaced by constants in gr.p{lVj. Similarly to grp{Il), we 
define the partial grounding of IT on Cn, denoted as pgr{II,Cn), as the program 
obtained from IT by grounding with the constants in Cn all variables except for the 
existential variables of rules that only occur in DL-atoms. Finally, given a partition 
{Gp, Gn) of grp{Il), we denote by n(Gp, Gn) the ground Datalog"^ program 
obtained from pgr{Il,Cu) by taking into account the two sets Gp and Gn so that 
no DL-predicate occurs in such a program. 

Let G be a set of Boolean CQs. Then, we denote by CQ{G) (resp. UCQ{Gj) 
the Boolean CQ (resp. UCQ) corresponding to the conjunction (resp. disjunction) 
of all the Boolean CQs in G. The algorithm NMSAT-2?/:-|-log for deciding NM- 
satisfiability of X'£+log~'^ KBs has a very simple structure (see Figure [Ij. It 
guesses a partition {Gp, Gn) of grp{'n) that is consistent with the VC-KB E = 
{T,A) (Boolean CQ/UCQ containment problem) and such that n(Gp, Gn) has a 
stable model. More details can be found in (jRosati 2006| . 

The decidability of reasoning, thus of ground query answering, in VC+LOG^^ 
depends on the decidability of the Boolean CQ/UCQ containment problem in DC. 

Theorem 2 

For any VC, satisfiability of VC+LOG'"'^ KBs (under both FOL and NM semantics) 
is decidable iff Boolean CQ/UCQ containment is decidable in DC (jRosati 2006^ . 

From Theorem [2] and from previous results on query answering and query con- 
tainment in DLs, it follows the decidability of reasoning in several instantiations 
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of P£+LOG^^. In all these decidable cases, ground queries can be answered by 
applying NMSAT-X>£+LOG. 

The complexity of reasoning in VC+LOG^"^ depends on the specific "DC chosen 
for instantiating the framework. We remind the reader to (IRosati 2006[) for the 
analysis of some cases. 

4 Inducing Database Views in VC+log^ with ILP 

In this section we consider the problem of defining a new view in a database whose 
schema is partly represented by an ontology. We suppose that there are tuples 
known to belong to the view as well as tuples known not to belong to the view. Cast 
in the I?£+LOG^ framework, this problem boils down to the problem of building 
2?£+LOG^ rules defining a Datalog predicate p which stands for the view name. 
Tuples are ground Datalog facts that are true for p if they belong to the view, false 
otherwise. The database problem of interest can be reformulated as the following 
problem of discriminant induction. 

Definition 2 
Given: 

• a Datalog database n and a VC ontology S integrated into a VC+LOG'^ 
KB B (background theory); 

• a Datalog predicate p (target predicate); 

• a set O of ground Datalog facts that are either true or false for p (examples) ; 
and 

• a set C of constraints on the form of VC+LOG'' definitions for p (language 
of hypotheses) 

the problem of defining the view of name p is to induce a set "H <Z C (hypothesis) of 
2?£+LOG^ rules from O and B such that % explains O by taking B into account. 

We assume that the background theory B in Definition [2] is a 2?£+LOG" KB 
which consists of an intensional part K. (i.e., the TBox T plus the set of rules) 
and an extensional part F (i.e., the ABox A plus the set lip of facts). Also we denote 
by Pc{B), Ptz{B), and Py){B) the sets of concept, role and Datalog predicate 
names occurring in B, respectively. Note that p ^ P-q{B). 

Example 3 

Throughout this section we shall consider a database 11 in the form of the following 
Datalog^ program: 

famous (Mary) 
famous (Paul) 
famous (Joe) 
scientist (Joe) 

containing also the rule 

[Rl] RICH(X) ^ famous (X), not scientist (X) 
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linking the database to the ontology S expressed as the following DC KB: 

[^1] RICHnUNMARRIED □ 3 WANTS-TO-MARRY" . T 
[A2] WANTS-TD-MARRYCLDVES 

UNMARRIED (Mary) 

UNMARRIED (Joe) 

Note that 11 and E can be integrated into a VC+LOG^ KB B (adapted from 
(|Rosati 20061) ) that concerns the individuals Mary, Joe, and Paul and builds upon 
the alphabets Pc{B) = {RICH/1, UNMARRIED/1}, Ptz{B) = {WANTS-TO-MARRY/2, LOVES/2}, 
and Py){B) = {famous/l, scientist/1}. 

The language C of hypotheses in Definition [2] must allow for the generation of 
2?£+LOG^ rules starting from three disjoint alphabets Pc{^) ^ Pc{lS), Pn{C) ^ 
P-r{B), and C Pq(6). Also we distinguish between Pj^iC) and Pj^{C) 

in order to specify which Datalog predicates can occur in positive and negative 
literals, respectively. More precisely, we consider X'£+LOG^ rules of the form 

p{X) <- ri(Fi), . . . , rm{Ym), si(^i), • ■ ■ , Sk{Zk), not ui[Wi), ...,not Uh{Wh) 

where the unique literal p{X) in the head is formed out of a DATALOG-predicate 
p which represents the target predicate. Note that the conditions of linkedness 
and connectedness usually assumed in ILP are guaranteed by the conditions of 
Datalog safeness and weak DC safeness valid in P/I+LOG^^. 

Example 4 

Suppose that the DATALOG-predicate happy is the target and the set Fj^(£^^PPy)U 
p^(/:happy-) y p^(£happy) ^ {famous/l, RICH/1, LQVES/2, WANTS-TO-MARRY/2} 
provides the building blocks for the language £^^PPy. The following VC+LOG~^ 
rules 

^happy happy (X) ^ famous (X) 

^happy happy (X) famous (X), RICH(X) 

^happy happy (X) ^ famous (X), LOVES (Y,X) 

^happy happy(X) famous (X) , WANTS-TO-MARRY(Y,X) 

belonging to £-'^appy ^^^^^ j^g considered definitions for the target predicate happy. 

The set O of observations in Definition [2] contains facts of the kind p{ai) where 
p is the target predicate and is a tuple of individuals occurring in the ABox A. 
We assume B H O = ^. Furthermore, the description of each observation Oi G O 
is in the background theory and may be incomplete due to the inherent nature 
of VC+LOG^ . Therefore, the normal ILP setting is the most appropriate to the 
learning problem in hand and can be extended to VC+LOG'^ as follows. 

Definition 3 

Let iZ e £ be a P^C+log"' rule, B a 2?£+log^ KB, p the target predicate, and 
Oi = p{ai) & O a, ground Datalog fact. We say that R covers Oi under entailment 
w.r.t. BiSBUR hp(^)- 

Note that the coverage test can be reduced to query answering in 2?£+LOG^ KBs 
which in turn can be reformulated as a satisfiability problem of the KB. 
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Example 5 

The rule i?^^^^^ mentioned in Example|4]covers the observation o^g^-^y — happy (Mary) 

because 6 U i?4^PP^ h happy (Mary). Indeed, all NM-models for B' = 6U i?4^PP^ 
satisfy: 

• famous (Mary) is in B; 

• 3 WANTS-TO-MARRY".T (Mary) , due to the axiom [^1] and to the fact that both 
RICH (Mary) and UNMARRIED (Mary) hold in every model of B' . In particular, 
RICH (Mary) holds because of [Rl]; 

• happy (Mary), due to the above conclusions and to the rule R^^^^^ . Indeed, 
since 3WANTS-T0-MARRY".T(Mary) holds in every model of B', it follows that 
in every model there exists a constant x such that WANTS-TO-MARRY(x,Mary) 
holds in the model, consequently from R^^^^^ it follows that happy (Mary) 
also holds in the model. 

Note that R^^^^^ does not cover the observations ojQg = happy (Joe) and opg^.^]_ = 
happy(Paul). More precisely, B' ^ happy(Joe) because scientist (Joe) holds in 
every model of B', thus making the rule [Rl] not applicable for {X/Joe}, therefore 
RICH(Joe) not derivable. Finally, B' ^ happy(Paul) because UNMARRIED (Paul) is 
not forced to hold in every model of B', therefore ElWANTS-TO-MARRY".T (Paul) is 
not forced by [Al] to hold in every such model. 

It can be proved that also R^^^^^ covers only oj^g^j-y, while R^^^^^ covers all 
the three observations and i?^^^^^ covers ojiiary '-'pg^-^]_ only. 

In order to support the induction of 2?£+LOG^ rules with ILP techniques, the 
language £ of hypotheses needs to be equipped with a generality order ^ so that 
{£, is a search space. Therefore, the next two subsections, Section l4?T] and Section 
14. 2i are devoted to suggested techniques for structuring and searching the hypothe- 
sis space, respectively. Conversely, Section 1431 sketches an ILP algorithm employing 
these techniques to solve the original problem of inducing database views. 



4-1 The hypothesis space 

The definition of a generality order for hypotheses in C must consider the pecu- 
liarities of VC+LOG^ . One issue arises from the presence of NAF literals (i.e., 
negated Datalog literals) both in the background theory and in the language of 
hypotheses. As pointed out in (jSakama 200ip . rules in normal logic programs are 
syntactically regarded as Horn clauses by viewing the NAF-literal -^p{X) as an 
atom not_p{X) with the new predicate not-p. Then any result obtained in ILP on 
Horn logic programs is directly carried over to normal logic programs. Assuming 
one such treatment of NAF literals, we propose to adapt generalized subsumption 
(jBuntine 1988| to the case of VC+i.OG~^ rules and provide a characterization of 
the resulting generality order, denoted by ^j^, that relies on the reasoning tasks 
known for VC+'LOG'^^ and from which a test procedure can be derived. 
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Definition 4 

Let i?i,i?2 & C he two VC+LOG^ rules standardized apart, /C a 2?£+LOG~' KB, 
and cr a Skolem substitution for R2 with respect to {Ri} U /C. We say that Ri 
is more general than R2 w.r.t. /C, denoted by i?i >^j^ R2, iff there exists a ground 
substitution 9 for i?i such that (i) head{Ri)9 = head{R2)a and (ii) ICUbody{R2)(J \= 
body{Ri)6. We say that Ri is strictly more general than R2 w.r.t. /C, denoted by 
i?i >-j^ i?2, iff ^1 R2 and ^i- ^1 equivalent to R2 w.r.t. 

/C, denoted by Ri =^ iZ2, iff Ri R2 and R2 ^k: Ri- 

Note that condition (ii) is a variant of the Boolean CQ/UCQ containment prob- 
lem because hody{R2)<J and body{Ri)d are both Boolean CQs. The difference be- 
tween (ii) and the original formulation of the problem is that JC encompasses not 
only a TBox but also a set of rules. Nonetheless this variant can be reduced to 
the satisfiability problem for finite VC+LOG'' KBs. Indeed the skolemization of 
body{R2) allows to reduce the Boolean CQ/UCQ containment problem to a CQ 
answering problem. Due to the aforementioned link between CQ answering and 
satisfiability, checking (ii) can be reformulated as proving that the KB (T, Hr U 
body{R2)a U {<— body{Ri)d}) is unsatisfiable. Once reformulated this way, (ii) can 
be solved by applying the algorithm NMSAT-P£-|-LOG. 

Example 6 

Let us consider the hypotheses 

^happy happy (A) ^ famous (A) 

^happy happy (X) ^ famous (X), RICH(X) 

reported in Example |4] up to variable renaming. We want to check whether 

happy phappy 

holds. Let cr = {X/a} be a Skolem substitution for R^^^^^ with respect to /C U 
^happy ^ _ {A/a} a ground substitution for R^^^^^ . Both conditions of 
Definition m are immediately verified. Thus, R^^^^^ R^^^^^ . Since the vicev- 
ersa does not hold, we can say that I^^^^^ R^^^^^ . Analogously, it can 
be proved that iJ^^^PP^ iJg^^PP^ and r\^^^^ R^^^^ . Also, it turns 
out that i?^^PP^ is incomparable under with iZg^PP^ and i?^^PP^. Finally, 

it can be proved that i?^^PP^ >~'^ i?4 ^PP^. In particular, the condition (ii) /C U 
{famous(a),LOVES(b,a)} [= {famous(a),WANTS-TO-MARRY(b,a)} is nothing else 
that a ground query answering problem in 2?£+LOG~'. The entailment is guaranteed 
by the axiom [A2]. 

It can be proved that >-~i^ is a decidable quasi-order (i.e. it is a reflexive and 
transitive relation) for P/I-I-LOG^ rules. In particular, the decidability of >-~j^ follows 
from the decidability of VC+hOG'" . 
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4-2 A refinement operator 

As pointed out in Section 14711 the space (£, >-j^) is a quasi-ordered set, therefore it 
can be searched by refinement operators, fn the foUowing, we define a downward 
refinement operator for a VC+LOG~' language. 

Definition 5 

Let £ be a VC+hOG'^ language of hypotheses built out of the three finite and 
disjoint alphabets Pc{^), Pn{^), and ^'q('C) UPq(£), and 

p{X) ^ ri{Yi),...,rrn{Y„,),si{Zi),...,Sk{Zk),not ui{Wi) , . . . , not Uh{Wh) 

be a rule R belonging to C. We define a downward reGnement operator p~' for 
(£, >k) such that the set p^{R) contains all R' ^ C that can be obtained from R 
by applying one of the following refinement rules: 

{AddDataLit_B+) hody{R') = body{R) U {r™+i(Fj+i)} if 

(r,„+i) ^ body(R) 
{AddDataLit_B-) body{R') = body{R) U {not Urn+i{Wh+i)} if 

2. Uh+i{Wl+i) i body{R) 
{AddOntoLit_B) body{R') = body{R) U {sk+i{Zk+i)} if 

1. Sk+i^Pc{C)UPn{C) 

2. it does not exist any s; e body{H) such that Sk+i C s; 
{SpecOntoLit_B) body{R') = {body{R) \ {si{Zi)}) U s^Zi) if 

1. ,S;'ePc(/:)U/'7z(£) 

2. S;' □ s; 

All the rules of are correct, i.e. the i?"s obtained by applying any of the 
rules oi to R ^ C are such that R R' . This can be proved intuitively 
by observing that they act only on body{R). Thus condition (i) of Definition 2] is 
satisfied. Furthermore, it is straightforward to notice that the application of any 
of the rules of to R reduces the number of models of R. In particular, as for 
(SpecOntoLit), this intuition follows from the semantics of DLs. So condition (ii) 
also is fulfilled. 

Example 7 

With reference to Example HI applying the refinement rule (AddDataLit-B^) to 
i?^^^PPy happy (X) ^ 

produces i?^^^^^ which can be further specialized into i?^^^^^, i?^^^^^, and 
by means of (AddOntoLit.B). Note that no other refinement rule can 
be applied to R^^^^^ and that R^^^^^ can be also obtained as refinement via 
{SpecOntoLit.B) from R^^"^^^ . 



18 



F.A. Lisi 



NMLEARN-©£+LOG^(£, B, O, p) 

2. iJ+ ^ {oj G 0\oi is true for p}; 

3. {oi G O\oi is false for p}; 

4. while / do 

5. R^{p{X)^}- 

6. ^ 

7. while / do 

8. Q^{R'e£\R'ep^iR)}; 

9. -f- best.ofiQ); 

10. ^ \ {e G |BU i? 1= e}; 

11. end while 

12. n^HU{R}; 

13. £;+ ^ \ {e G 1= e}; 

14. endwhile 
return H 

Figure 2. Main procedure of NMLEARN-Pr+LOG" 

Ideal refinement operators have been proven not to exist for clausal languages 
ordered by 0-subsumption or stronger orders but can be approximated by dropping 
the requirement of properness or by bounding the language ( |Nienhuys-Cheng and de Wolf 1997] ) . 
We choose the latter option because it guarantees that, if (£, ^) is a quasi-ordered 
set, C is finite and ^ is decidable, then there always exists an ideal refinement op- 
erator for (£, y). In our case, since is a decidable quasi-order for any VC with 
decidable Boolean CQ/UCQ containment problem, we only need to bound £ in a 
suitable manner. From Definition [5] we know that the alphabets Pc{C), Pji{C), and 
P^(£)UPj^(£) are finite. Having Datalog as basis for the CL part of VC+LOG~" 
avoids the generation of infinite terms. Yet, the expressive power of 2?£+LOG^ re- 
quires several other bounds to be imposed on C in order to guarantee its finiteness. 
It is necessary to introduce a complexity measure for T>C-\-LOG^ rules, as a pair 
of two different coordinates. Considering that the complexity of a VC+LOG'^ rule 
resides in its body, the former coordinate is the size (i.e. the difference between the 
number of symbol occurrences and the number of distinct variables) of the biggest 
literal in body{R), while the latter is the number of literals in hody{R). To keep C 
finite, we need first to set a maximum value for these two coordinates. Second, it is 
necessary to set the maximum number of specialization/generalization steps of the 
DL literals so that the search in the ontology is also depth-bounded. 

4-3 An algorithm 

The algorithm in Figure [2] defines the main procedure of NMLEARN-2?£-t-LOG^. 
Notice that the outer loop (4-14) corresponds to a variant of the sequential covering 
algorithm, i.e., it learns new rules one at a time, removing the positive examples 
covered by the latest rule before attempting to learn the next rule (13). The hy- 
pothesis space search performed by NMLEARN-P/I-I-LOG^ is best understood by 
viewing it hierarchically. Each iteration through the outer loop (4-14) adds a new 
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rule to its disjunctive hypothesis H. The effect of each new rule is to generate the 
current disjunctive hypothesis (i.e., to increase the number of instances it classifies 
as positive), by adding a new disjunct. Viewed at this level, the search is a bottom- 
up search through the space of hypotheses, beginning with the most specific empty 
disjunction (1) and terminating when the hypothesis is sufficiently general to cover 
all positive training examples (14). The inner loop (7-11) performs a finer-grained 
search to determine the exact definition of each new rule. This loop searches a sec- 
ond hypothesis space, consisting of conjunctions of literals, to find a conjunction 
that will form the preconditions for the new rule. Within this space, it conducts 
a top-down, hill-climbing search, beginning with the most general preconditions 
possible (5), then adding literals one at a time to specialize the rule (7) until it 
avoids all negative examples. To select the most promising specialization from the 
candidates generated at each step (9), NMLEARN-I?£-|-LOG" considers the per- 
formance of the rule over the training examples, i.e. it maximizes the number of 
positive examples covered while keeping the number of negative examples covered 
as low as possible. 

Example 8 

With reference to Example [7] and Example [SJ we suppose that 
= {opaull 

The outer loop of the algorithm NMLEARN-P^-t-LOG" starts from: 
i?fPPy happy(X) ^ 

which is further refined through the iterations of the inner loop, more precisely it 
is first specialized into: 

^^appy happy (X) ^ famous (X) 

which in turn, since it covers negative examples, is then specialized into: 

^happy happy (X) ^ famous (X), RICH(X) 

^happy happy(X) ^ famous(X), LOVES(Y,X) 

^happy happy(X) ^ famous (X) , WANTS-TO-MARRY(Y,X) 

out of which the rule iJg ^^^^ is selected as the best and added to the hypothesis 
because it does not cover negative examples. Note that B^^^^^ is preferred to 
^happy ^g(,g^^gg jg jjiore general. 



5 Inducing Database Constraints in P/I+log^^ with ILP 

In this section we face the problem of inducing an integrity theory % for a database 
n whose instance IIj? is given and whose schema K, encompasses an ontology E and 
a set Hr of rules linking the database to the ontology. We assume that 11 and S 
shares a common set of constants so that they can constitute a 2?£-f LOG"^ KB B. 
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Definition 6 
Given: 

• an intensional Datalog database Hr and a VC ontology S integrated into 
a VC+hOG'"'^ KB K (background theory); 

• a set O = IiF of ground Datalog facts (observation); and 

• a set C of constraints on the form of 2?£+LOG^^ rules to be induced (language 
of hypotheses) 

the problem of defining an integrity theory for lip is to induce a set % d C (hy- 
pothesis) of 2?£-|-LOG^^ rules from O and /C such that % confirms O by taking IC 
into account. 

Note that, as opposite to the learning problem formally stated in Definition [21 the 
background theory in Definition |6] is a P£+LOG^^ KB JC which does not include 
the extensional part lip the database. Indeed 11^? plays the role of the unique 
observation from which the learning process should induce a theory "H. Conversely, 
similarly to Section SI we denote by Pc{B), Pn{!3), and P^^iB) the sets of concept, 
role and Datalog predicate names occurring in B, respectively, assuming that 

B = Eun. 

Example 9 

Throughout this section we shall refer to a database about students in the form 
of a Datalog^^ program 11 which consists of an extensional part Hp with the 
following facts: 

boy (Paul) 
girl (Mary) 
enrolled (Paul , cl) 
enrolled(Mary , cl) 
enrolled(Mary , c2) 
enrolled(Bob, c3) 

and an intensional part Hr with the following rules: 

[Rl] FEMALE (X) ^ girl(X) 
[R2] MALE(X) ^ boy(X) 

linking the database to an ontology about persons expressed as the following DC 
KB E: 

[^1] PERSON C 3 FATHERS .MALE 
1^2] MALE C PERSON 
1^3] FEMALE □ PERSON 
[^4] FEMALE C -.MALE 

MALE (Bob) 

PERSON (Mary) 

PERSON (Paul) 
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Note that n and S can be integrated into a I?£+LOG~'^ KB B (adapted from 
(|Rosati 2006P ) that concerns the individuals Bob, Mary, and Paul and builds upon 
the alphabets PciB) = {FEMALE/1, MALE/1, PERSON/1}, Ptz{B) = {FATHER/2}, and 
^d('^) = {boy/1, girl/1, enrolled/2}. 

The language C of hypotheses in Definition [6] must allow for the generation of 
VC+LOG"'^ rules starting from three disjoint alphabets PciJ^) ^ Pci^), Pni^) C 
Ptz{B), and Pj^iC) C Py){B). Analogously to Sectional we distinguish between 
Pj^{C) and P^{C) in order to specify which Datalog predicates can occur in 
positive and negative literals, respectively. 

Example 10 

The following VC+LOG"'^ rules: 

PERSON(X) ^ enrolled(X,cl) 
boy(X) V girl(X) ^ enrolled (X,cl) 

enrolled(X,c2), MALE(X) 

enrolled (X,c2), not girl(X) 
MALE(X) ^ enrolled(X,c3) 

belong to the language C built upon the alphabets Pc{t^) — Pc{^), Pn{^) — 9, 
Pj^{C) = {boy/1, girl/1, enrolled(_, cl) , enrolled(_, c2) , enrolled(_, c3) }, 
and Py){C) ^ {boy/1, girl/1}. 

The scope of induction in the learning problem of interest is characterization 
because we are looking for a theory which confirms the observation. Also, since a 
"D/Z+LOG^^ KB may be incomplete due to the inherent nature of this KR frame- 
work, the most appropriate setting for induction is the one for learning from entail- 
ment. The coverage test proposed in the following generalizes the case illustrated 
in Definition [3] to observations which are not singletons of facts. 

Definition 7 

Let i? e £ be a VC+log^'^ rule, IC a VC+log^'^ KB, and O = {pj(aO} a set 
of ground DATALOG facts. We say that R covers O under entailment w.r.t. JC iff 

/CUi? h Ap.(«1)- 

It is immediate to notice that the coverage test of Definition [7] can be reduced to 
Boolean CQ answering in 2?£+LOG^^ KBs and therefore to a NM-satisfiability 
problem. 

In the following we sketch the ingredients for an ILP system able to discover such 
integrity theories on the basis of NMSAT-2?£+LOG. 

5.1 The hypothesis space 

The order of relative subsumption (|Plotkin 197ip is suitable for extension to "D^C-I-LOG^ 
rules because it can cope with arbitrary clauses and admit an arbitrary finite set 
of clauses as the background theory. 
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Definition 8 

Let i?2 G £ be two 2?/:+LOG"'^ rules, and K a P-C+LOG^^ KB. We say that 
Ri is more general than R2 w.r.t. IC, denoted by i?i ^j^^ R2, if there exists a 
substitution 6* such that K, |= V(i?i0 i?2)- We say that i?i is strictly more general 
than R2 w.r.t. /C, denoted by i?i ^^^^ R2, iff i?i ^j^"^ i?2 and R2 tic ^i- We say 
that Ri is equivalent to R2 w.r.t. /C, denoted by Ri R2, iff Ri hlc^ R2 and 

Example 11 

Let us consider the following P/I+LOG^^ rules belonging to the language C specified 
in Example [TUl 

Rx boy(X) ^ enrolled(X,cl) 

R2 boy(A) V girl(A) ^ enrolled(A,cl) 

It can be easily proved that i?i i?2- Let = {X/A} be the substitution to be 
applied to Ri and let us suppose that, for every A, if A is enrolled in the course cl, 
then A is a boy (i.e. the rule RiO is true), thus we can also say that A is either a 
boy or a girl (i.e. the rule R2 is true). Note that R2 Ri- 

Let us now consider the following P/I+LOG^^ rules also belonging to C: 

R3 MALE(X) ^ enrolled(X,cl) 

R4, PERSON(A) ^ enrolled(A,cl) 

In order to prove that i?3 ^j^^ R4, we apply 9 = {X/A} to R3 and suppose that, 
for every A, if A is enrolled in the course cl, then A is a MALE (i.e. the rule Ri9 is 
true). Due to axiom [^2] occurring in the ontology S reported in Example [HI A is 
a PERSON (i.e. the rule R4 is true). It is immediate to verify that R3 >-~^ R4. 

The generality relation defined by >^j^^ is a quasi-order on P/I+LOG^^ rules, 
therefore the resulting space (£, ^j^^) can be searched by means of refinement 
operators. 

5.2 The refinement operator 

A refinement operator for (£, hlc^) should generate 2?£+LOG^^ rules good at ex- 
pressing integrity constraints. Since we assume the database 11 and the ontology S 
to be correct, a rule R must be modified to make it satisfiable by 11 U E by either 
(i) strenghtening body(R) or (ii) weakening head(R). 

Definition 9 

Let £ be a VC+log"^ language of hypotheses built out of the three finite and 
disjoint alphabets Pc{^), Pn{^), and PYy{^) U^'£)('C), and 

Pi(/i) V... Vp„(X„) ^ ^ 
ri(y"i), . . . , r„(yj„), si(Zi), . . . , Sfc(Zfc), not ui[Wi),...,not Uh{Wh) 

be a rule R belonging to C. We define a downward refinement operator for 
(£, )zk) such that the set p~^^ {R) contains all R' ^ C that can be obtained from R 
by applying one of the following refinement rules: 
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{AddDataLit_B+) body{R') = body{R) U {r,„+i( F^+i)} if 

{Y„,+i) ^ body{R) 
{AddDataLit_B-) body{R') = body{R) U {not wC+i)} if 

1. Uh+i e Pj^{C) 

2. Uh+i{Wh+i) ^ body{R) 
{AddOntoLit_B) body{R') = body{R) U {sk+iiZk+i}} if 

1. Sk+i G Pc{C)UPn{C) 

2. it does not exist any s; G body{H) such that s/j-i-i C s; 
{SpecOntoLit_B) body{R') = {body{R) \ {si{Zi)}) U s[{Zi) if 

1. S;'ePc(/:)UFK(>C) 

2. s[ □ s; 

(AddDataLtt_H) head{R') = head{R) U {p„+i(X,r+i)} if 

1. Pn+ieP+{C) 

2. p„+i(X„"+i) ^ head{R) 

{AddOntoLit_H) head{R') = head{R) U {p„+i(X„"+i)} if 

1. e Pc(/:)uFk(/:) 

2. it does not exist any pi G head{R) such that C p^ 
{GenOntoLit_H) head{R') = {head{R) \ U p[{X^) if 

1. p[ePc{C)\JPn[C) 

2. P, C p[ 

Note that, since we are working under NM-semantics, two distinct rules, namely 
(AddDataLit.B^) and (AddDataLit.H), are devised for adding negated Datalog 
atoms to the body and for adding Datalog atoms to the head, respectively. It can 
be proved that all the rules of p^^ are correct, i.e. the iZ"s obtained by applying 
any of the rules of to i? G £ are such that R >~'^ R' . Intuitively, it is sufficient 
to observe that the application of any of the rules of p^^ conceived to strenghten 
body{R) reduces the number of models of R whereas the rules aiming at weakening 
head{R), when applied, do not augment the number of models of R. 

Example 12 

From the rule belonging to the language C specified in Example 1101 

^ enrolledCX, cl) 

we obtain the following rules by applying (AddDataLit_B^): 

^ enrolledCX, cl), boy (X) 

^ enrolledCX, cl), girlCX) 

^ enrolledCX, cl) , enrolledCX, c2) 

<— enrolledCX, cl) , enrolledCX, c3) 
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NMDISC-0£+LOG^''(£, /C, Hp) 

1. -H ^ 

2. Q^{ □} 

3. while Q / do 

4. Q^Q\{i?}; 

5. if NMSAT-r'/:+LOG(K; uupunu {r}) 

6. then H ^ "H U {7?} 

7. else Q ^ Qu{i?' G £|i?' G p^^(i?)} 

8. endif 

9. endwhile 
return T-L 

Figure 3. Main procedure of NMDISC-VC+LOG^'^ 

the following ones by applying {AddDataLit.B^): 

■i— enrolled (X, cl) , not boy(X) 
^ enrolledCX, cl) , not girl(X) 

the following ones by applying (AddOntoLit.B): 

^ enrolled(X,cl), PERSDN(X) 
^ enrolled(X,cl), FEMALE(X) 
^ enrolled(X,cl), MALE(X) 

the following ones by applying (AddDataLit.H): 

boy(X) <- enrolled(X,cl) 
girl(X) ^ enrolled(X,cl) 
enrolled(X,c2) ^ enrolled(X, cl) 
enrolled(X,c3) ^ enrolled(X, cl) 

and the following ones: 

PERSON(X) ^ enrolled(X,cl) 
FEMALE(X) ^ enrolled(X,cl) 
MALE(X) ^ enrolled(X,cl) 

by applying (AddOntoLit^H) . 

5.3 The algorithm 

The integrity theory H we would like to discover is a set of P£+LOG^^ rules. It 
must be induced by taking the background theory K, — XI/j into account so 
that S = (S, nu-H) is a NM-satisfiable VC+hOC'^ KB. The algorithm in Figure[3] 
defines the main procedure of NMDISC-2?£+LOG^^: it starts from an empty theory 
Ti. {!), and a queue Q containing only the empty clause (2). It then applies a search 
process (3) where each element R is deleted from the queue Q (4), and tested for 
satisfaction w.r.t. the data 11^ by taking into account the background theory K, and 
the current integrity theory T-L (5). Note that the NM-satisfiability test includes also 
the current induced theory in order to deal with the nonmonotonicity of induction 
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in the normal ILP setting. If the rule R is satisfied by the database (6), it is added to 
the theory (7). If the rule is violated by the database, its refinements according to C 
are considered (8). The search process terminates when Q becomes empty (9). Note 
that the algorithm does not specify the search strategy. In order to get a minimal 
theory (i.e., without redundant clauses), a pruning step and a post-processingphase 



can be added to NMDISC-X'/:+LOG^'^ by further calling NMSAT-I?/:+logI!|. 

Example 13 

With reference to Example [T^ the following VC+hOG^^ rule: 

PERSON(X) ^ enrolled(X,cl) 

is the only one passing the NM-satisfiability test at step (5) of the algorithm 
NMDISC-P£+LOG"^. It is added to the integrity theory. All the other rules are 
further refined. When the learning process ends at step (9) because the queue of 
rules has become empty, the integrity theory will encompass the rules reported in 
Example [TUl because they are satisfied by the database. 

6 Related Work 

Very few ILP frameworks have been proposed so far that adopt a hybrid DL-CL rep- 
resentation for both hypotheses and background knowledge (|Rouveirol and Ventos 2000l 
IKietz 20031 ILisi 2008| |Lisi and Esposito 2008D . They are less or differently expres- 
sive than the one presented in this paper. 

The framework proposed in (jRouveirol and Ventos 2000)) focuses on discriminant 
induction and adopts the ILP setting of learning from interpretations. Hypotheses 
are represented as Cakin- ACM non-recursive rules with a Horn literal in the head 
that plays the role of target concept. The coverage relation of hypotheses against 
examples adapts the usual one in learning from interpretations to the case of hybrid 
Cakw- ACM BK. The generality relation for hypotheses is defined as an extension 
of generalized subsumption. Procedures for testing both the coverage relation and 
the generality relation are based on the existential entailment algorithm of Carin. 
Following (jRouveirol and Ventos 2000|) . Kietz studies the learnability of Carin- 
ACM , thus providing a pre-processing method which enables ILP systems to learn 
Ch^m-ACM rules (j2003|). 

In ()Lisi 2008p . the representation and reasoning means come from AC-\o%. Hy- 
potheses are represented as constrained Datalog clauses. Note that this frame- 
work is general, meaning that it is valid whatever the scope of induction is. The 
generality relation for one such hypothesis language is an adaptation of generalized 
subsumption to the ^£-log KR framework. It gives raise to a quasi-order and can be 
checked with a decidable procedure based on constrained SLD-resolution. Coverage 
relations for both ILP settings of learning from interpretations and learning from 
entailment have been defined on the basis of query answering in AC-Xo'g. As opposite 

* Based on the following consequence of the Deduction Theorem in FOL: Given a KB B and a 
rule i? in DZ^-I-LOG^^, we have that B\= R'lfiB l\ -•R is unsatisfiable. 
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to URouveirol and Ven tos 2000|. the framework has been partiaUy implemented in 
an ILP system (jLisi and Mal erba 2004^ that supports a variant of frequent pattern 
discovery where rich prior conceptual knowledge is taken into account in order to 
find patterns at multiple levels of description granularity. 

The framework presented in ( Lisi and Esposito 2008 ) is the closest to the present 
work. Indeed it faces the problem of learning in 2?£+LOG, i.e. by disregarding 
the NM features of 2?£+log^^. Yet the framework is more general than the one 
illustrated here because two cases of rule learning are considered, one aimed at 
inducing rules with one Datalog literal in the head and the other rules with one 
VC literal in the head. The former kind of rule will enrich the Datalog part of 
the KB, whereas the latter will extend the VC part. 

The main procedure of NMLEARN-2?£+log" follows the principles of FOIL 
dQuinlan 1990D but shows some peculiarities due to the nature of the underlying KR 
framework, e.g. the setting of learning from entailment (which is more powerful than 
the use of extensional background theory and coverage testing), and the ordering 
of generalized subsumption (instead of 0-subsumption). 

The main procedure of NMDISC-P/:+log^^ is inspired by CLAUDIEN \T)e Raedt and Bruynooghe 1993[ | 
as for the scope of induction and the algorithm scheme but differs from it in several 
points, notably the adoption of (i) relative subsumption instead of 6'-subsumption, 
(ii) stable model semantics instead of completion semantics, and (iii) learning from 
entailment instead of learning from interpretations, to deal properly with the cho- 
sen representation formalism for both the background theory and the language of 
hypotheses. 

ILP has been also applied to data engineering tasks such as the interactive re- 
structuring of databases giving rise to the so-called Inductive Data Engineering 
(IDE) (jFlach 19931 IFlach 19981 ISavnik and Flach 2000p . The main idea is to use 
induction to determine integrity constraints, such as functional and multivalued 
dependencies, that are valid (or almost valid) in a database and then use the con- 
straints to decompose (restructure) the database. 



7 Conclusions and Future Work 

In this paper, we have investigated two ILP solutions for learning in the KR frame- 
work of P/I-I-log^^, both valid for any VC for which the instantiation of the 
framework is decidable, but one restricted to Datalog"' and the other for the full 
framework. Indeed, well-known ILP techniques for induction such as the orderings 
of generalized subsumption and relative subsumption have been reformulated in 
terms of the deductive reasoning mechanims of 2?£-|-LOG^^, namely by relying on 
the algorithm NMSAT-2:»£-t-log devised to prove NM-satisfiability of VC+hOG^"^ 
KBs. Notably, we have defined generality orders, refinement operators and coverage 
tests on the basis of NMSAT-2?£-Flog. Though the work presented in this paper 
is not yet supported by empirical evidence, it shows that it is feasible for ILP to 
go beyond Datalog towards VC+hOG"""^ . The potential of this extended ILP has 
been illustrated in two traditional database problems, i.e. the definition of views 
and the definition of integrity theories, for which we have sketched ad-hoc ILP al- 
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gorithms, NMLEARN-P/:+LOG^ and NMDISC-P/I+LOG^'^, respectively. The NM 
features as well as the DL component of I?£+LOG^^ enable these algorithms to 
build hypotheses with expressiveness far greater than the one reachable with the 
predecessors FOIL and CLAUDIEN. Notably, ontologies accommodate elegantly in 
the solution to the database problems being considered. From the ILP viewpoint the 
expressive power of VC+LOG^^ has, of course, raised some technical difficulties. 
In particular, the critical point has been the DL component that has required an 
appropriate treatment when defining both the generality orders and the refinement 
operators. Also the setting of learning from entailment turned out to be the most 
appropriate for the induction within the 2?£+LOG^^ KR framework. 

As next step towards any practice, we plan to first analyze the complexity and 
then produce an efficient and scalable implementation of these ILP algorithms. 
Adopting less expressive but tractable instantiations of X'£+LOG^^ may turn out 
crucial from this point of view. E.g., DL-Lite (jCalvanese et al. 2007P has been 
proved to be good at making I?£+LOG^^ practically useful (jRosati 2006|) . An- 
other point is the definition of so-called optimal refinement operators to be actually 
employed in NMLEARN-P/:+LOG^ and NMBISC-P^+LOG^"^. Indeed, ideal re- 
finement operators are mainly of theoretical interest, because in practice they are 
often very inefficient. More constructive - though possibly improper - refinement 
operators are usually to be preferred over ideal ones. Optimal refinement operators 
can be easily derived from those proposed in this paper. 

Learning in 2?£-f LOG~'^ is also promising for Semantic Web applications for the 
following reasons. First, it can deal with ontologies almost as expressive as the ones 
that OWL allow. Indeed, as already mentioned, SHIQ has been the starting point 
for the definition of OWL and gives rise to one of the currently most expressive 
decidable instantiations of VC+LOG^"^ . Second, it can deal with incomplete knowl- 
edge thanks to the NM features of I?£-fLOG~'^. Third, it can deal with ontologies 
and rules tightly integrated as devised by the W3C Rule Interchange Format (RIF) 
working groupo Indeed the activity of the RIF group concerns (i) the definition of 
a core language with extensions some of which (the nonmonotonic ones) will most 
likely be inspired by hybrid DL-CL languages like VC+LOG^"^ and (ii) the identi- 
fication of use cases many of which are suitable to our algorithms for application. 

As a final remark, we would like to point out that the shift from Datalog to 
2?£-|-LOG~'^ in ILP paves the way to an extension of Relational Learning (and 
Data Mining), named Onto- Relational Learning, which accounts for ontologies in a 
clear, well-founded and systematic way. Following the work reported in this paper, 
we can build new-generation ILP systems able to learn from relational databases 
integrated with ontologies according to the principles of Onto- Relational Learning. 
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